Retrieving Historical Manuscripts using Shape
نویسندگان
چکیده
Convenient access to handwritten historical document collections in libraries generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Currently, extensive manual labor is used to annotate and organize such collections, because handwriting recognition approaches provide only poor results on old documents. In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, we extract discrete feature vectors, and estimate the joint probability distribution of features and word labels. For a given feature vector (i.e. a word image), we can then calculate conditional probabilities for all labels in the training vocabulary. Experiments show that this relevance-based language model works very well with a mean average precision of 89% for 4-word queries on a subset of George Washington’s manuscripts. We also show that this approach may be extended to general shapes by using the same model and a similar feature set to retrieve general shapes in two different shape datasets.
منابع مشابه
A line-based representation for matching words in historical manuscripts
0167-8655/$ see front matter 2011 Elsevier B.V. A doi:10.1016/j.patrec.2011.02.013 ⇑ Corresponding author. Tel.: +90 312 2903143; fax E-mail addresses: [email protected] (E.F. Ca (P. Duygulu). In this study, we propose a newmethod for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matchin...
متن کاملMargins are more important than text, Historical values of margins, memorial notes and colophons of Manuscripts in Zoroastrian tradition
In the Zoroastrian tradition, the most important challenge and the most ambiguous issue is ambiguity in history and neglect of time and chronology. Perhaps, this approach that historical time is limit and the begging and end of time is clear and the goodness will be conqueror eventually; it is because of ambiguity of history in Zoroastrian tradition.since early time to now, the Zoroastrian re...
متن کاملA Statistical Approach to Retrieving Historical Manuscript Images without Recognition
Handwritten historical document collections in libraries and other areas are often of interest to researchers, students or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Several solutions are possible: manual annotation (ve...
متن کاملComputerized Recognition System for Historical Manuscripts
The article describes the process of creating a universal computerized recognition system of historical manuscripts, including historical shorthand records dating back to the 19th and early 20th centuries. We discuss the problem of getting the original graphical representation of symbols from historical manuscripts using a threshold binarization method. We search for a similar graphical represe...
متن کاملMethods of the Arabic Manuscripts Digitization
1 The authors acknowledge Saint-Petersburg State University for a research grant 2.37.175.2014. Abstract The mediaeval Arabic manuscripts are not only valuable artifacts but they also represent one of the major sources of scholar information in the field of Oriental Studies. This paper discusses the methods of Arabic Manuscripts Digitization. Over the last fifteen years a lot of Arabic manuscri...
متن کامل